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Abstract 


As parallel systems move into the production 
scientific-computing world, the emphasis will 
be on cost-effective solutions that provide high 
throughput for a mix of applications. Cost- 
effective solutions demand that a system make 
effective use of aU of its resources. Many MIMD 
multiprocessors today, however, distinguish be- 
tween “compute” and “I/O” nodes, the latter 
having attached disks and being dedicated to 
running the file-system server. This static di- 
vision of responsibilities simphfies system man- 
agement but does not necessarily lead to the best 
performance in workloads that need a different 
balance of computation and I/O. 

Of course, computational processes sharing a 
node with a file-system service may receive less 
CPU time, network bandwidth, and memory 
bandwidth than they would on a computation- 
only node. In this paper we begin to examine 
this issue experimentally. We found that high- 
performance I/O does not necessarily require 
substantial CPU time, leaving plenty of time for 
application computation. There were some com- 
plex file-system requests, however, which left lit- 
tle CPU time available to the application. (The 
impact on network and memory bandwidth stiU 
needs to be determined.) For applications (or 
users) that cannot tolerate an occasional inter- 
ruption, we recommend that they continue to use 
only compute nodes. For tolerant applications 
needing more cycles than those provided by the 


compute nodes, we recommend that they take 
fuU advantage of both compute and I/O nodes for 
computation, and that operating systems should 
make this possible. 

1 Introduction 

Programmers of scientific computer applications 
are increasingly turning to parallel systems for 
their production computing. In today’s climate 
of tightening budgets, however, their managers 
demand cost-effective solutions that provide high 
throughput for a mix of apphcations. Several 
applications, each with different computational 
and I/O needs, are simultaneously active within 
a single multiprocessor. Cost-effective solutions 
demand that a system make effective use of all 
of its resources. 

Many MIMD multiprocessors today are config- 
ured with two distinct types of processor nodes: 
those that have disks attached, which are ded- 
icated to file I/O, and those that do not have 
disks attached, which are used for running ap- 
plications. This static division of responsibilities 
simplifies system management but does not nec- 
essarily lead to the best performance in work- 
loads that need a different balance of compu- 
tation and I/O. For example, a system which 
makes all nodes available to computational apph- 
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cations increases its overall computational power 
and may therefore be more cost effective. 

Computational processes running on nodes 
that also serve part of the file system, however, 
may receive less CPU time, network bandwidth, 
and memory bandwidth than they would on a 
computation-only node. The conventional wis- 
dom is that the CPU overhead of the file-system 
code running on I/O nodes, coupled with the 
unpredictable and erratic nature of I/O activ- 
ity, would substantially disrupt the performance 
of computational apphcations. In this paper we 
examine this issue experimentally, focusing on 
the impact of a file-system server on the CPU 
time available to local computational processes. 
We found that high-performance I/O does not 
necessarily require substantial CPU time, leav- 
ing plenty of time for appUcation computation. 
There were some complex file-system requests, 
however, which left Uttle CPU time available 
to the application. (The impact on network 
and memory bandwidth stiU needs to be deter- 
mined.) For applications (or users) which can- 
not tolerate an occasional interruption, we rec- 
ommend that they continue to use only compute 
nodes. For other applications, particularly those 
that can adapt to changing load, we recommend 
that they consider taking full advantage of both 
compute and I/O nodes for computation. After 
aU, our results show that the I/O nodes usually 
had cycles to spare. 

We begin in the next section with background 
information about multiprocessor file systems. 
Section 3 describes some simulations and their 
results and Section 4 describes some measure- 
ments on a real system. We summarize our con- 
clusions in Section 5. 

2 Background 

There are many diflferent parallel file systems 
[Kri94, Pie89, FPD93, Roy93, LIN+93, DdR92, 
CF94, Dib90, DSE88, MS94, HdC95, HER+95]. 
Most, though not all, are designed for ma- 
chines that have dedicated I/O nodes. Most 
are based on a fairly traditional Unix-Uke inter- 
face, in which individual processes make a re- 


quest to the file system for each piece of the file 
they read or write. Increasingly common, how- 
ever, are specialized interfaces to support mul- 
tidimensional matrices [CFPB93, SW94, GL91, 
GGL93, BdC93, BBS+94, Mas92, SCJ+95], and 
interfaces that support collective I/O [GGL93, 
BdC93, BBS'^94, Mas92]. With a coUective-I/0 
interface, all processes make a single joint re- 
quest to the file system, rather than numerous 
independent requests. 

Disk-directed I/O is a promising new tech- 
nique that takes advantage of a coUective-I/0 
interface, and leads to much better performance 
than file systems based on traditional caching 
strategies [Kot94]. With disk- directed I/O, com- 
pute nodes make a collective request to the file 
system, which forwards the request to aU I/O 
nodes. Each I/O node exa m ines the request to 
determine which file blocks are on its disks, sorts 
the file blocks by physical location to produce 
an efficient schedule, and then begins a series of 
transfers according to the schedule. In effect, 
the I/O nodes are in charge of the data transfer, 
which is organized to best suit the disks’ per- 
formance characteristics. Each I/O node uses 
two buffers to overlap disk transfer and network 
transfer. For example, when reading, one buffer 
is filled by reading a block from disk while an- 
other buffer is emptied by scattering its contents 
among the compute-node memories according to 
the requested distribution. Data transfers be- 
tween compute nodes and I/O nodes use low- 
overhead “Memput” and “Memget” messages 
that move data directly to and from the appli- 
cation buffer. The experiments in [Kot94] show 
that disk-directed I/O obtains nearly the peak 
disk bandwidth across many data distributions 
and system configurations. 

There have been no previous studies of 
CPU activity on the I/O nodes of multiproces- 
sors. A ten-year old study of diskless work- 
stations [LZCZ86] found that file-server CPU 
load can be extremely high. To be able to pro- 
vide high performance during periods of intense 
I/O activity, however, a balanced multiprocessor 
spreads its disks across many 1/ O nodes so that 
the I/O-node CPUs wiU not be a performance 
bottleneck. This configuration leaves open the 
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possibility that the I/O nodes will be underuti- 
hzed during other periods. 

3 Simulation Experiments 

We wanted to measure the worst- case impact of 
unpredictable I/O interruptions on a computa- 
tional application, so we devised an experiment 
involving two 16-processor applications on a 32- 
node multiprocessor, in which one application 
did nothing but I/O, and the other did noth- 
ing but computation. The I/O application ei- 
ther read or wrote a file that was striped across 
disks attached to the computational application’s 
processors. Thus, the computational apphcation 
was occasionally interrupted so that the file sys- 
tem could service I/O requests for the other ap- 
plication. These interruptions slowed the com- 
putational apphcation in two ways. First, ev- 
ery cycle spent servicing the I/O request was 
another cycle delay for the interrupted apphca- 
tion. Second, delaying one process in the com- 
putational apphcation indirectly delayed other 
processes that waited for the process at a future 
synchronization point [MCD+91]. 

In our experiments we used two different kinds 
of computational apphcations, 36 different kinds 
of I/O apphcations, and two different kinds of file 
systems, ah on a parallel file-system simulator. 

3.1 Computational applications 

Our two computational apphcations did noth- 
ing but computation. The first apphcation, de- 
signed to measure the effect of interruptions on 
raw computational performance, had no syn- 
chronization or other communication between 
processes. The second apphcation was designed 
to measure the effect of load imbalance caused 
by I/O-related interruptions, by having all pro- 
cesses meet at a barrier every 1 msec of virtual 
time. With no interruptions, ah processes would 
meet at every barrier at precisely the same phys- 
ical times, and thus would never wait. An in- 
terruption of the computation on one processor, 
however, delayed both that process and all other 
processes that had to wait for it at the next bar- 
rier. Thus, a small perturbation of the execution 


time of one process could have a ripple effect that 
wa.s much larger than the original. 

We chose to use barriers because they have the 
most drastic effects on performance if the pro- 
cessors become unbalanced: all processes must 
wait for the slowest process. Similarly we chose 
a tight 1 msec interval to represent a challeng- 
ing case (several NASA benchmarks on the Intel 
Paragon and an SGI cluster were measured with 
inter-barrier times of 6, 17, or 64 msec [Nit94]). 

Note that our barrier experiment also repre- 
sents a computational apphcation that is run- 
ning on many processors, only some of which are 
involved in serving I/O, while others are left to 
run at fuU speed. AU other things being equal, 
those without I/O interruptions will always have 
to wait for those with I/O interruptions. If those 
slow processors run at 95% of fuh speed, then 
the whole apphcation runs at 95% of full speed, 
regardless of the number of uninterrupted pro- 
cessors. 

3.2 I/O applications 

Our I/O apphcations did nothing but I/O. They 
each transferred a one- or two-dimensional ar- 
ray of records, but in either case the file size was 
10 MB (1280 8-KB blocks). While 10 MB is not a 
large file, prehminary tests showed quahtatively 
similar results with 100 and 1000 MB files. Thus, 
10 MB was a compromise to save simulation 
time. The file was striped, block by block, across 
the 16 disks attached to the computational appli- 
cation’s processors. The matrix was distributed 
across the 16 memories of the I/O apphcation ac- 
cording to one of the HPF distributions [HPF93], 
as shown in Figure 1. Each matrix element was 
either 8 bytes or 8 Kbytes. Clearly, patterns that 
use 8-byte elements and a column-cychc distri- 
bution lead to a fine-grained data distribution, 
and typicaUy to more I/O overhead. 

3.3 File-system implementations 

The file accessed by the I/O apphcations was 
striped across aU 16 disks. Within each disk the 
blocks of the file were laid out contiguously, that 
is, the logical blocks of the file were laid out in 


3 


HPF array-distribution patterns 


0 

NONE (m) 
cs = 8 



r 

E 



BLOCK (rb) 
cs = 2 


01230123 


CYCLIC (rc) 
cs = 1, s = 4 


NONE 
NONE 
(mn) 
cs = 64 


BLOCK 
NONE 
(rbn) 
cs= 16 


CYCLIC 
NONE 
(rcn) 
cs = 8 
s = 32 



NONE 

BLOCK 

0 




(mb) 
cs = 2 

1 

2 



s = 8 




3 


BLOCK 
BLOCK 
(rbb) 
cs = 4 
s = 8 


0 

1 

2 

3 


CYCLIC 
BLOCK 
(rcb) 
cs = 4 
s = 16 


0 

1 

2 

3 

0 

1 

2 

3 

0 

1 

2 

3 

0 

1 

_2 

_2 


NONE 

CYCLIC 

(me) 

0 

1 



0 

1 



cs= 1 
s = 4 


2 

3 


2 

3 


BLOCK 
CYCLIC 
(rbc) 
cs= 1 
s = 2 


0 


0 


0 


0 



1 


1 


1 


1 

2 


2 


2 


2 



3 


3 


3 


3 


CYCLIC 
CYCLIC 
(rcc) 
cs = 1 
s = 2, 10 


0 

1 

0 

1 

0 

1 

0 

1 

2. 

1 

2. 

1 

2. 

2 

2. 

1 

0 

1 

0 

1 

0 

1 

0 

1 

2. 

1 

2. 

1 

2. 

1 

2. 

2 

0 

1 

0 

1 

0 

1 

0 

1 

2. 

1 

2. 

1 

2. 

1 

2. 

1 

0 

1 

0 

1 

0 

1 

0 

1 

2. 

1 

2. 

1 

2. 

1 

2. 

1 


Figure 1: Examples of matrix distributions, which we used as file-access patterns in our experi- 
ments. These examples represent common ways to distribute a 1x8 vector or an 8x8 matrix over 
four processors. Patterns are named by the distribution method (NONE, BLOCK, or CYCLIC) 
in each dimension (rows first, in the case of matrices). Each region of the matrix is labeled with 
the number of the compute node responsible for that region. The matrix is stored in row-major 
order, both in the file and in memory. The chunk size (cs) is the size of the largest contiguous 
chunk of the file that is sent to a single compute node (in units of array elements), and the stride 
(s) is the file distance between the beginning of one chunk and the next chunk destined for the 
same compute node, where relevant. 



consecutive physical blocks on disk. We chose 
this layout because it provides the highest I/O 
throughput, thus keeping the file-system code 
the most busy. Any other layout would trans- 
fer data more slowly, requiring interruptions less 
often. 

We modeled two different file systems; tradi- 
tional caching and disk-directed I/O. Traditional 
caching was meant to simulate a typical paral- 
lel file system where compute nodes, on behalf 
of apphcation processes, made independent re- 
quests to the appropriate I/O nodes. Each ap- 
phcation request to a compute node was for some 
contiguous range of bytes in the file, but because 
the file was striped by blocks, each compute- node 
request to an I/O node could be for at most 
one block. The I/O nodes each maintained a 
block cache, with LRU replacement and support 
for prefetching and write-behind. The I/O node 
was multithreaded, with a new thread created for 
each incoming request. Threads shared a data 
structure describing the LRU buffer fist, block- 
ing when waiting for a buffer to be flushed for 
re-use, or for a buffer to be filled with new data 
from disk. This choice led to a clean design with 
plenty of concurrency, at the cost of some thread- 
switching overhead. More importantly, the dis- 
tribution of I/O-request service times was highly 
variable, depending on whether it was a cache hit 
or miss, could easily locate a free buffer, and so 
forth. 

Disk-directed I/O is a new technique that 
takes advantage of a coUective-I/0 interface, and 
leads to much better performance than tradi- 
tional caching [Kot94]. As described above, it 
works by giving control over the order and pace 
of data transfer to the I/O nodes, who optimize 
the transfer for maximum disk performance. Af- 
ter an initial burst of CPU activity to deter- 
mine the disk schedules, the only ongoing CPU 
overhead is to compute the distribution of each 
block’s data among the compute-node memories. 
When reading, for example, some blocks com- 
ing off of disk must be spht into several smaller 
pieces, which are sent to the remote compute- 
node memories. Some distributions involve sub- 
stantial computations to determine the ultimate 
location of each element. 


3.4 Measurement methodology 

Rather than actually running a computational 
application, we measured the fraction of CPU 
time available for running a computational ap- 
plication on one set of processors, during the 
period the I/O apphcation was running on the 
other set of processors. Before and after the I/O 
apphcation ran, of course, there were no inter- 
ruptions and so the computational apphcation 
received 100% of the CPU’s time; since we were 
interested in the effect of the I/O requests, we 
only measured the period when the I/O apph- 
cation was running. Note that this methodol- 
ogy means that the I/O interruptions had prior- 
ity over the computation; again, this experiment 
was designed to expose the worst-case effects on 
the computational apphcation. 

To make this measurement, we coUected traces 
of the CPU activity on the I/O nodes of our two 
file systems, under load from one of the I/O ap- 
phcations. We processed the traces to count idle 
cycles as a proportion of total cycles (i.e., the 
inverse of the CPU utihzation). However, not 
aU idle cycles would be available to a real com- 
putation, due to the overhead for switching con- 
text between the apphcation and the file system. 
For each interruption, therefore, we deducted 
50 fisec} Idle intervals shorter than 50 fxsec were 
therefore useless to the computation, and so were 
not counted. 

3.5 Simulator 

Our traces were coUected from the STARFISH 
paraUel file-system simulator [Kot94], which 
ran on top of the Proteus paraUel-architecture 
simulator [BDCW91], which in turn ran on 
a DEC-5000 workstation. Proteus itself has 
been vahdated against real message-passing ma- 
chines [BDCW91]. We configured Proteus using 
the parameters hsted in Table 1. These parame- 
ters are not meant to reflect any particular ma- 
chine, but a generic machine of current technol- 
ogy- 

^This is a moderate context-switch time [ALBL91], 
even when cache effects are considered. In any case, pre- 
liminary experiments showed that our results were not 
sensitive to this parameter. 
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Table 1: Parameters for simulator. 


Distributed- memory 
MIMD 

Compute processors 
I/O processors 
CPU speed, type 

32 processors 
16 
16 

50 MHz, RISC 

Disks 

16 

Disk type 

HP 97560 

Disk capacity 

1.3 GB 

Disk peak transfer rate 

2.34 Mbytes/s 

File-system block size 

8 KB 

I/O buses (one per lOP) 

16 

I/O bus type 

SCSI 

I/O bus peak bandwidth 

10 Mbytes/s 

Interconnect topology 

6x6 torus 

Interconnect bandwidth 

200 X 10® bytes/s 
bidirectional 

Interconnect latency 

20 ns per router 

Routing 

wormhole 


We added a disk model, a reimplementation of 
Ruemmler and Wilkes’ HP 97560 model [RW94, 
KTR94]. We validated our model against disk 
traces provided by HP, using the same technique 
and measure as Ruemmler and Wilkes. Our im- 
plementation had a demerit percentage of 3.9%, 
which indicates that it modeled the 97560 accu- 
rately. 

3.6 Results 

Figure 2 compares the impact of all 36 I/O ap- 
plications on our first computational application, 
as well as showing the I/O bandwidth achieved 
by the I/O application. Ideally, all points 
would be in the upper-right corner, indicating 
high I/O throughput and computational perfor- 
mance. Most of the disk-directed-I/0 points 
are there, except for six “hard” patterns on the 
left. Traditional caching had much poorer I/O 
performance, and its CPU needs were slightly 
smaller (to some extent the CPU needs appear 
smaller because the CPU impact was spread over 
a longer physical time, due to the poor I/O per- 
formance). 


To get a better understanding of Figure 2, 
we selected two representative patterns for more 
detailed presentation: one that was extremely 
easy and fast in both file systems, and another 
that was extremely complex and slow in both 
file systems. The easy pattern (representing 
points in the upper right) distributed a one- 
dimensional matrix of 8- KB records cyclically 
among the memories (recall that 8 KB was the 
file-system block size). The hard pattern (rep- 
resenting points in the lower left) distributed a 
two-dimensional matrix of 8-byte records among 
the memories in a BLOCK-CYCLIC layout, to 
use HPF terminology. We look at both the read 
and write versions of these two patterns, for a 
total of four cases. ^ 

Table 2 shows the results in detail for each 
of these four access patterns and each file sys- 
tem. The “easy” access patterns took little CPU 
time, leaving 90-95% of the CPU for the com- 
putational application. Nonetheless, they sus- 
tained 32-33 MB/s, which is 86-89% of the disks’ 
peak bandwidth. Of the two file systems, disk- 
directed I/O had higher I/O throughput and less 
CPU demand. 

For the “hard” access patterns, however, the 
situation was quite different. I/O performance 
suffered, in traditional caching because it man- 
aged the disks and cache poorly, and in disk- 
directed I/O because of the amount of CPU over- 
head in handling thousands of 8-byte messages.^ 
Nonetheless, this example points out a situation 
where the I/O benefits of disk-directed I/O were 
enormous. It came at a cost, however, in terms 
of the amount of CPU overhead required, which 
in the worst case left only 3.4% of the CPU cy- 
cles available for the computational application. 
The CPU overhead of traditional caching does 
not seem to be so bad, but this was again par- 
tially due to the poor I/O performance spreading 
out the overhead over many cycles. 

When we added barrier synchronizations to 
the computational apphcation, the I/O activity 

^In [Kot94], the easy patterns are called rc and wc 
with 8-KB records, and the hard patterns are called rbc 
and wbc with 8-byte records. 

®We suspect the latter may be improved with a 
gather/scatter message-passing mechanism. 
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Table 2: Percent of CPU time available to the computational application (100% is ideal), and the 
amount of data throughput achieved by the I/O application. 



Traditional Caching 

Disk-directed I/O 


CPU available 

I/O throughput 

CPU available 

I/O throughput 


(percent) 

(MBytes/s) 

(percent) 

(MBytes/s) 

easy read 

95. 

32.2 

95. 


easy write 

90. 

32.4 

95. 


hard read 

60. 

2.2 

3.4 


hard write 

87. 

0.7 

5.1 



Table 3: A comparison of the amount of CPU time usable by the computation, with and without 
barrier synchronization. In the presence of load imbalance caused by I/O interruptions, barriers 
cause some processors to idle, reducing the percentage of CPU that was “usable.” 



Traditional Caching 

Disk-directed I/O 


CPU available (%) 

CPU available (%) 


no barriers 

barriers 

no barriers 

barriers 

easy read 

95. 

92. 

95. 

93. 

easy write 

90. 

85. 

95. 

92. 

hard read 

60. 

3.2 

3.4 

2.0 

hard write 

87. 

1.6 

5.1 

2.3 


of course had a bigger effect. Figure 3 plots the 
effect of all 36 access patterns on this synchro- 
nizing application. Table 3 focuses on the same 
representative cases as before. First, note that 
there was only minimal effect on the easy ac- 
cess patterns. The interruptions were short and 
rare, leading to little disturbance. On the “hard” 
patterns in the traditional-caching file system, 
however, there was a dramatic effect due to the 
highly variable amount of computation needed 
for cache-management operations (for example, 
a cache miss took much more computation than 
a cache hit), leading to load imbalance within 
the computational application. 
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Figure 2: I/O throughput vs. computational performance for aU 36 different access patterns, and 
both file-system implementations. The upper-right corner represents the best cases; there are 
actually 41 points above 30 MB/s, many of which overlap in this picture. 
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Figure 3: Similar to Figure 2, but with a computational apphcation that includes a barrier 
synchronization every 1 msec of virtual time. Again, many of the points in the upper right 
overlap. 
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4 Measurement Experiments 

The simulations in the previous section allowed 
us to examine the eflfects of a variety of work- 
loads on two very different file systems in a con- 
trolled setting. To support these results, we have 
also measured the effects of a real file system 
on a real computation, using a cluster of eight 
IBM RS/6000-250 workstations in Dartmouth’s 
FLEET lab.'* We used a LINPACK benchmark 
program as a computational application. We ran 
several copies of this program in parallel, one on 
each of six workstations. Each process ran 10 
iterations of the LINPACK computation, stop- 
ping for a barrier at 16 points within each it- 
eration (on average, after every half second of 
computation).® Needless to say, this synthetic 
parallel application is perfectly load balanced. 
Then, we had one of the other two workstations 
run a simple program that either read or wrote 
a 400 MB file with 1 KB requests, sequentially 
or randomly, where the file was served through 
NFS from one of the hosts running the LIN- 
PACK program. Due to the periodic barriers, 
any slowdown experienced by that node caused 
the entire application to slow down. (As a con- 
trol, we ran a similar test with six workstations 
running the LINPACK program while the other 
two did I/O, one as client and one as server; de- 
spite the network traffic, the I/O had no effect on 
the LINPACK program’s barriers.) While this 
experiment does not directly correspond to any 
of the patterns used in Section 3, it is shghtly 
harder than the “easy” pattern examined there. 

Table 4 presents the results. Although we can- 
not fuUy explain the differences in the effects of 
the I/O access patterns, it is clear that the ap- 
plication was able to run at 50-85% efficiency 
despite the CPU impact of the I/O. Faster pro- 
cessors, which would be found in any substantial 
parallel machine, should experience even less im- 
pact. Given the heavyweight nature of this op- 
erating system and the NFS file system, these 
results corroborate those in the previous section. 


^For more information see 

http : // WWW . cs . dartmouth.edu/research/fleet/. 

^We used MPI [Wal94] for the communication support. 


Table 4: Execution time of a synthetic parallel 
computation, in seconds. In the “No I/O” case, 
this application runs alone, and represents the 
ideal execution time for this application. In the 
other cases one of the nodes is burdened with 
heavy NFS traffic. “Efficiency” represents the 
performance relative to the ideal execution time. 



Time (sec) 

Efficiency 

No I/O 

89.2 


Sequential read 

177.4 

50.3% 

Random read 

113.9 

78.3% 

Sequential write 

105.0 

85.0% 

Random write 

146.2 

61.0% 


5 Discussion and conclusions 

Large multiprocessors with many processors and 
disks have great potential for fast computations 
and high I/O throughput. Due to their cost, 
however, it is important to use their resources 
efficiently. To provide the high-performance I/O 
needed by some applications, many multiproces- 
sors today dedicate a subset of their nodes to 
I/O. Our results show that for some complex 
file- request patterns, these dedicated nodes were 
saturated. For many simpler patterns, however, 
the I/O-node CPUs were largely idle, that is, 
with 80-99% available that could be used for 
running appbcations. Furthermore, even apph- 
cations that synchronized at a barrier every mil- 
lisecond could profitably obtain about 80-97% 
of the I/O node’s CPU time for computation. 
Disk-directed I/O usually needed less CPU time 
than a traditional caching file system. Measure- 
ment results from a real file system on a cluster 
of workstations corroborated these results. 

Please note that our specific experimental re- 
sults are dependent on the simulated and real ar- 
chitectures and workloads that we used. Indeed, 
real multiprocessor configurations will have a 
different balance between CPU speed and disk 
speed, a different mix of “easy” and “hard” work- 
loads, and different ratios of compute nodes, I/O 
nodes, and disks. Given a similar workload, sys- 
tems with fewer I/O nodes or slower I/O-node 
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CPUs will of course appear to have busier CPUs. 

No matter what configuration, however, we ex- 
pect that the fundamental conclusion remains: 
for any fixed configuration there wiU fikely be 
periods when the I/O-node CPUs are underuti- 
Uzed while some apphcations are CPU-bound, 
and periods when the 1/0-node CPUs are fuUy 
utilized. The system should thus be configured 
with sufficient 1/ 0 nodes to sustain the heaviest 
I/O load, but the operating and run-time sys- 
tems should be flexible enough to allow tolerant 
applications to use 1/0-node CPUs when avail- 
able. 

This paper should only be considered a start- 
ing point, as we have only considered the im- 
pact of I/O service on the CPU utilization of 
an I/O node. File-I/0 traffic may also substan- 
tially impact the communication performance of 
a computation-only apphcation [BBH95]. File- 
system activity will also compete with a com- 
putation for memory bandwidth and cache 
space. Finally, efficient system software would 
be needed to provide the flexibility that we pro- 
pose. Nonetheless, we feel that the issue is worth 
further exploration. An implementation, and ex- 
perimentation with a real workload, are neces- 
sary. 
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